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(b) 



All the claims are believed to be directed to a single invention. If the 



Office determines that all the claims presented are not obviously directed to a single 
invention, then Applicants will make an election without traverse as a prerequisite to the 
grant of special status. 



a classification search and a computer database search. The searches were performed on or 
around September 15, 2004, and were conducted by a professional search firm, Kramer & 
Amado, P.C. The classification search covered Class 709 (subclasses 203, 219, and 225), 
Class 71 1 (subclasses 1 1 1 and 112), and Class 713 (subclasses 165, 200, 201, and 202) for 
the U.S. and foreign subclasses identified above. The computer database search was 
conducted on the USPTO systems EAST and WEST. The inventors further provided three 
references considered most closely related to the subject matter of the present application (see 
references #4-6 below), which were cited in the Information Disclosure Statements filed on 
October 9, 2003. 

(d) The following references, copies of which are attached herewith, are 
deemed most closely related to the subject matter encompassed by the claims: 

(1) U.S. Patent No. 6,122,631; 

(2) U.S. Patent Publication No. 2002/0103904 Al; 

(3) U.S. Patent Publication No. 2003/0023784 Al ; 

(4) Bakke et al., "iSCSI Naming and Discovery, " available on-line 



(c) 



Pre-examination searches were made of U.S. issued patents, including 



at http://www.ietf.org/internet-drafts/drafl-ietf-ips-iscsi-name- 
disc-09.txt, Internet Draft of IPS Working Group, the Intenet 
Society (2002); 



(5) 



Gibson et al., "File Server Scaling with Network- Attached 
Secure Disks," Proceedings of the 1997 ACM Sigmetrics 
International Conference on Measurement and Modeling of 
Computer Systems, Seattle, WA (1997); and 
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(6) V AH ALIA UNIX Internals: The New Frontiers, pp. 291-313, 
Prentice Hall (1995). 

(e) Set forth below is a detailed discussion of references which points out 
with particularity how the claimed subject matter is distinguishable over the references. 

A. Claimed Embodiments of the Present Invention 

The claimed embodiments relate to a file server that provides a plurality of 
networked clients with file services. 

Independent claim 1 recites a file server system comprising a plurality of hard 
disk drives connected to a plurality of clients via a network; and a file control unit connected 
to the network for accepting an access request from the clients to the hard disk drives to 
manage the data input/output of the plurality of hard disk drives. The file control unit has 
configuration information with which a plurality of pieces of identification (ID) information, 
each identifying one of the plurality of hard disk drives, can be registered. The file control 
unit broadcasts a hard disk drive search message via the network. In response to the hard 
disk drive search message, the hard disk drive returns the ID information specifying the self 
hard disk drive to the file control unit. In response to the returned ID information, the file 
control unit establishes a setting such that the hard disk drive, which has returned the ID 
information, cannot communicate with devices on the network other than the file control unit. 

Independent claim 1 1 recites a file server system comprising a plurality of 
switching hubs interconnected to form a network; a plurality of hard disk drives connected to 
clients via the network; and a file control unit. Each of the plurality of hard disk drives is 
connected to one of the plurality of switching hubs. The file control unit is connected to one 
of the plurality of switching hubs. The file control unit accepts an access request from the 
clients to the hard disk drives to manage a data input/output of the plurality of hard disk 
drives. The switching hubs perform control so that the file control unit and the plurality of 
clients belong to a virtual network and so that the file control unit and the plurality of hard 
disk drives belong to another virtual network. 

One of the benefits that may be derived is securing the safety of data saved on 
hard disk drives when clients and hard disk drives are connected to the same LAN. By 
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preventing the clients or the management terminal from reading data from, or writing data to, 
the hard disk drive without obtaining permission from the file control unit, the system ensures 
data safety since data is transferred always via the file control unit. In addition, by inhibiting 
the clients and the management terminal from directly accessing hard disk drives in the 
system employing separate virtual networks, the file control unit and the hard disk drives 
need not have the encrypted communication function and thus the cost may be reduced. 

B. Discussion of the References 

None of the following references disclose a file server system in which the file 
control unit broadcasts a hard disk drive search message via the network; wherein, in 
response to the hard disk drive search message, the hard disk drive returns the ID information 
specifying the self hard disk drive to the file control unit; and wherein, in response to the 
returned ID information, the file control unit establishes a setting such that the hard disk 
drive, which has returned the ED information, cannot communicate with devices on the 
network other than the file control unit. 

The references also do not teach a file server system in which the file control 
unit accepts an access request from the clients to the hard disk drives to manage a data 
input/output of the plurality of hard disk drives, wherein the switching hubs perform control 
so that the file control unit and the plurality of clients belong to a virtual network and so that 
the file control unit and the plurality of hard disk drives belong to another virtual network. 

L U.S. Patent No. 6,122,631 

This reference discloses a dynamic server-managed access control for a 
distributed file system with a server file access token which it delivers to the distributed file 
system and to the client. The client uses the token in place of a standard file name. If a 
request to access (open) the file is received from a client with the token, the file is opened. 

2. U.S. Patent Publication No. 2002/0103904 Al 

This reference discloses a method and an apparatus for controlling access to 
files associated with a virtual server. Each virtual server is also assigned an identifier to 
uniquely identify that server and all files associated with that server. The server computing 
device retrieves the identifier assigned to the existing file. Next, the server computing device 
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determines whether the identifier is associated with the virtual server that generated the file 
access request. If the identifier is associated with the virtual server that generated the file 
access request, the server computing device allows access to take place. 

3. U.S. Patent Publication No. 2003/0023784 Al 

This reference relates to a storage system having a plurality of controllers. 
The assigned controller identity number is the identification number of the controller to 
which the disk drive unit identified at path ID and address is allocated. The processor 
searches the disk pool management table using the identification number in the inquiry as the 
search key to identify the disk drive units that can be used by the inquiring controller. See 
[0070] and [0084]. 

4. Bakke et aL, "iSCSI Naming and Discovery," available on-line at 
http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-name-disc-09.txt, 
Internet Draft of IPS Working Group, the Intenet Society (2002) 

This reference discloses iSCSI which is a standard that allows SCSI protocol 
communication to be performed on a network. 

5. Gibson et al., "File Server Scaling with Network- Attached Secure Disks," 
Proceedings of the 1997 ACM Sigmetrics International Conference on 
Measurement and Modeling of Computer Systems, Seattle, WA (1997) 

This reference discloses NetSCSI and NASD. This is discussed in the present 
specification at page 2, line 14 to page 4, line 21. 

6. VAHALIA UNIX Internals: The New Frontiers, pp. 291-313, Prentice Hall 
(1995) 

This reference relates to NFS (Network File System), which is a technology 
for managing data on a file basis. A computer in which files are saved is called a file server, 
and a computer that uses the files saved in a file server via a network is called a client. NFS 
is a technology that allows the user to use files saved in the file server as if they were saved in 
the client's disk. In practice, NFS is defined as a network communication protocol between a 
file server and a client. 
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(f) In view of this petition, the Examiner is respectfully requested to issue 
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T^rF«^S ra MFTFf re -J° rklng docu "! ents of the Internet Engineering 
Task Force (ETF), its areas, and its working groups Note that other 
groups may also distribute working documents as Internet-Drafts. 

Internet-Drafts are draft documents valid for a maximum of six months 

?il ma Tt b ? Upd3ted ' re ej ac + ed - or obsoleted by other document at any 
time. It is inappropriate to use Internet-Drafts as reference 
material or to cite them other than as "work in progress." 

The list of current Internet-Drafts can be accessed at 
nttp://www. i etf. org/1 id-abstracts, html 

The list of Internet-Draft Shadow Directories can be accessed at 
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Copyright Notice 

Copyright (C) The Internet Society (2002). All Rights Reserved. 
Abstract 

This document provides examples of iSCSI (SCSI over TCP) name 
hS n ?<r<? !n? + t n ? d,sc ^?ion of discovery of iSCSI resources (targets) 
aL£ ] u n '-u a i-I s - - Th i? d ? cument complements the iSCSI protocol 
2«?. ft ' / l ?2 , ? l ! ,ty ,S SD e key Riding principle behind this 
document. That is an effort has been made to satisfy the needs of 
both small isolated environments, as well as large environments 
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1. iSCSI Names and Addresses 

The main addressable, discoverable entity in iSCSI is an iSCSI Node 
An iSCSI node can be either an initiator, a target, or both The 
rules for constructing an i SCSI name are specified in [iSCSI]. 

This document provides examples of name construction that might be 
used by a naming authority. 

Both targets and initiators require names for the purpose of 
identification, and so that iSCSI storage resources can be managed 
regardless of location (address). An iSCSI name is the unique 
identifier for an iSCSI node, and is also the SCSI device name [SAM2] 
of an iSCSI device. The iSCSI name is the principal object used in 
authentication of targets to initiators and initiators to targets 
This name is also used to identify and manage iSCSI storage 
resources. 

Furthermore, iSCSI names are associated with iSCSI nodes instead of 
with network adapter cards to ensure the free movement of network 
HBAs between hosts without Joss of SCSI state information 
(reservations, mode page settings etc) and authorization 
conf igurat ion. 

An iSCSI node also has one or more addresses. An iSCSI address 
specifies a single path to an iSCSI node and consists of the iSCSI 
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name, plus a transport (TCP) address which uses the following format: 

<doma i n-name> [ : <port>3 
Where <doma in-name> is one of: 

- IPv4 address, in dotted decimal notation. Assumed if the name 
contains exactly four numbers, separated by dots (. ), where each 
number is in the range 0. . 255. 

- IPv6 address, in colon-separated hexadecimal notation, as specified 
|n |pp£2732] and enc,osed ' n "t and "1" characters, as specified 

- Fully Qualified Domain Name (host name). Assumed if the <domain- 
name> is neither an IPv4 nor an IPv6 address. 

For iSCSI targets, the <port> in the address is optional; if 
specified, it is the TCP port on which the target is listening for 
connections. If the <port> is not specified, the default port 3260 
assigned by I ANA, will be assumed. For iSCSI initiators, the <port> 
is omitted. 

Examples of addresses: 

192. 0. 2.2 
192. 0.2.23:5003 

[FEDC:BA98: 7654 :3210:FEDC:BA98: 7654:3210] 

[1080:O:O:O:8:8O0:2O0C:417A] 

[3ffe:2a00: 100: 7031 : :1] 

[1080: :8:800:200C:417A] 

[1080: :8:800:20OC:417A] :3260 

[ 192. 0. 2. 5] 

mydisks. example, com 

moredisks. example, com: 5003 

v(2) 
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The concepts of names and addresses have been carefully separated in 
i SCS I : 

- An iSCSI Name is a location- independent, permanent identifier for 
an iSCSI node. An iSCSI node has one iSCSI name, which stays 
constant for^the life of the node. The terms "initiator name" and 
"target name" also refer to an iSCSI name. 

- An iSCSI Address specifies not only the iSCSI name of an i SCS I 
node, but also a location of that node. The address consists of a 
host name or IP address, a TCP port number (for the target), and 
the iSCSI Name of the node. An iSCSI node can have any number of 
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addresses, which can change at any time, particularly if they are 
assigned via DHCP. 

A similar analogy exists for people. A person in the USA might be: 
Robert Smith 

SSN+DateOfBirth: 333-44-5555 14-MAR-1960 
Phone: +1 (763) 555. 1212 

Home Address: 555 Big Road, Minneapolis, MN 55444 
Work Address: 222 Freeway Blvd, St. Paul, MN 55333 

In this case, Robert's globally unique name is really his Social 
Security Number plus Date of Birth. His common name, "Robert Smith", 
is not guaranteed to be unique. Robert has three locations at which 
he may be reached; two Physical addresses, and a phone number. 

In this example, Robert's SSN+DOB is like the iSCSI Name (date of 
birth is required to disambiguate SSNs that have been reused), his 
phone number and addresses are analogous to an iSCSI node's TCP 
addresses, and "Robert Smith" would be a human-friendly label for 
this person. 

To assist in providing a more human-readable user interface for 
devices that contain iSCSI targets and initiators, a target or 
initiator may also provide an alias. This alias is a simple UTF-8 
string, is not globally unique, and is never interpreted or used to 
identify an initiator or device within the iSCSI protocol. Its use is 
described further in section 2. 



1.1. Constructing iSCSI names using the iqn. format 

The iSCSI naming scheme was constructed to give an organizational 
naming authority the flexibility to further subdivide the 
responsibility for name creation to subordinate naming authorities. 
The iSCSI qualified name format is defined in [iSCSI] and contains 
(in order) : 

- The str ing "iqn. " 

- A date code specifying the year and month in which the organization 
registered the domain or sub-domain name used as the naming 
author ity str ing. 

- The organizational naming authority string, which consists of a 
valid, reversed domain or subdomain name. 



Voruganti, et. al. Expires September 2003 [Page 4] 

Internet Draft iSCSI Naming and Discovery March 2003 



^-v(3) 



draft- i etf- i ps- i scs i -name-d i sc-09 

- Optionally, a followed by a string of the assigning 
organization's choosing, which must make each assigned iSCSI name 
unique. 

The following is an example of an iSCSI qualified name from an 
equipment vendor: 

Organizational Subgroup Naming Authority 
Naming and/or string Defined by 
Type Date Auth Org. or Local Naming Authority 

+ -M- + -j + + j. 

I II II II | 

iqn. 2001-04. com. example:diskarrays-sn-a8675309 

Where: 

"iqn" specifies the use of the iSCSI qualified name as the 
authority. 

"2001-04" is the year and month on which the naming authority 
acquired the domain name used in this iSCSI name. This is used to 
ensure that when domain names are sold or transferred to another 
organization, iSCSI names generated by these organizations will be 
unique. 

"com. example" is a reversed DNS name, and defines the 
organizational naming authority. The owner of the DNS name 
"example.com" has the sole right of use of this name as this part 
of an iSCSI name, as well as the responsibility to keep the 
remainder of the iSCSI name unique. In this case, example.com 
happens to manufacture disk arrays. 

"diskarrays" was picked arbitrarily by example.com to identify the 
disk arrays they manufacture. Another product that ACME makes 
might use a different name, and have its own namespace independent 
of the disk array group. The owner of "example.com" is 
responsible for keeping this structure unique. 

"sn" was picked by the disk array group of ACME to show that what 
follows is a serial number. They could have just assumed that all 
iSCSI Names are based on serial numbers, but they thought that 
perhaps later products might be better identified by something 
else. Adding "sn" was a future-proof measure. 

"a8675309" is the serial number of the disk array, uniquely 
identifying it from all other arrays. 
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Another example shows how the ' :' separator helps owners of sub- 
domains to keep their name spaces unique: 

Naming Defined by 

Type Date Authority Naming Authority 

+ — ++ + + + + + 

i ii i i ii i 

iqn. 2001-04. com. example, storage : tape. sysl. xyz 

Naming Defined by 

Type Date Authority Naming Author ity 

+ — ++ + + + + + 

ii i i ii i 



iqn. 2001-04. com. example, storage. tape: sysl. xyz 

Note that, except for the ' :' separator, both names are identical. 
The first was assigned by the owner of the subdomain 
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"storage.example.com"; the second was assigned by the owner of 
"tape.storage.example.com". These are both legal names, and are 
unique. 

The following is an example of a name that might be constructed by an 
research organization: 

Naming Defined by Defined by 

Type Date Authority cs dept User "oaks" 
+_+ + + + + + + + + 

I I I I I II II I 

iqn. 2000-02. edu. example. cs: users. oaks:proto. target4 

In the above example, Professor Oaks of Example University is 
building research prototypes of iSCSI targets. EU' s computer science 
department allows each user to use his or her user name as a naming 
authority for this type of work, by attaching "users. <username>" 
after the ' :', and another ' :', followed by a string of the user's 
choosing (the user is responsible for making this part unique). 
Professor Oaks chose to use "proto. target4" for this particular 
target. 
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The following is an example of an iSCSI name string from a storage 
service provider: 

Organ izat ion Str ing 

Naming Defined by Org. 
Type Date Authority Naming Authority 
■i — +- + + + + + + 

I I I I I II I 

iqn. 1995-1 1 . com. example, ssp Customers. 4567. disks. 107 

In this case, a storage service provider (ssp.example.com) has 
decided to re-name the targets from the manufacturer, to provide the 
flexibility to move the customer's data to a different storage 
subsystem should the need arise. 

The SSP has configured the iSCSI Name on this particular target for 
one of its customers, and has determined that it made the most sense 
to track these targets by their Customer ID number and a disk number. 
This target was created for use by customer #4567, and is the 107th 
target configured for this customer. 

Note that when reversing these domain names, the first 
component (after the "iqn.") will always be a top-level domain name, 
which includes "com", "edu". "gov", "org", "net", "mil", or one of 
the two-letter country codes. The use of anything else as the first 
component of these names is not allowed. In particular, companies 
generating these names must not eliminate their "com." from the 
str ing. 

Again, these i SCS I names are NOT addresses. Even though they make 
use of DNS domain names, they are used only to specify the naming 
authority. An iSCSI name contains no implications of the iSCSI 
target or initiator's location. The use of the domain name is only a 
method of re-using an already ubiquitous name space. 



1.2. Constructing i SCS l names using the eui. format 

The iSCSI eui. naming format allows a naming authority to use IEEE 
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EU I —64 identifiers in constructing iSCSI names. The details of 
constructing EU 1-64 identifiers are specified by the IEEE 
Registration Authority (see [EUI64]). 
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Example iSCSI name : 

Type EU 1-64 identifier (ASCI I -encoded hexadecimal) 
+ — ++ + • 

I II I 

eu i. 02004567A425678D 



2. iSCSI Alias 

The iSCSI alias is a UTF-8 text string that may be used as an 
additional descriptive name for an initiator and target. This may 
not be used to identify a target or initiator during login, and does 
not have to follow the uniqueness or other requirements of the i SCS I 
name. The alias strings are communicated between the initiator and 
target at login, and can be displayed by a user interface on either 
end, helping the user tell at a glance whether the initiators and/or 
targets at the other end appear to be correct. The alias must NOT be 
used to identify, address, or authenticate initiators and targets. 

The alias is a variable length string, between 0 and 255 characters, 
and is terminated with at least one NULL (0x00) character, as defined 
in [iSCSI]. No other structure is imposed upon this string. 



2.1. Purpose of an Alias 

Initiators and targets are uniquely identified by an iSCSI Name. 
These identifiers may be assigned by a hardware or software 
manufacturer, a service provider, or even the customer. Although 
these identifiers are nominally human- readable, they are likely be 
be assigned from a point of view different from that of the other 
side of the connection. For instance, a target name for a disk array 
may be built from the array's serial number, and some sort of 
internal target ID. Although this would still be human-readable and 
transcr ibable. it offers little assurance to someone at a user 
interface who would like to see "at-a-g lance" whether this target is 
really the correct one. 

The use of an alias helps solve that problem. An alias is simply a 
descriptive name that can be assigned to an initiator target, that is 
independent of the name, and does not have to be unique. Since it is 
not unique, the alias must be used in a purely informational way. It 
may not be used to specify a target at login, or used during 
authent icat ion. 

Both targets and initiators may have aliases. 
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2. 2. Target Al i as 

To show the utility of an alias, here is an example using an alias 
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for an ISCSI target. 

Imagine sitting at a desktop station that is using some iSCSI devices 
over a network. The user requires another iSCSI disk, and calls the 
storage services person (internal or external), giving any 
authentication information that the storage device will require for 
the host. The services person allocates a new target for the host, 
and sends the Target Name for the new target, and probably an 
address, back to the user. The user then adds this Target Name to 
the configuration file on the host, and discovers the new device. 

Without an alias, a user managing an iSCSI host would click on some 
sort of management "show targets" button to show the targets to which 
the host is currently connected. 

+■ — Connected-To-These-Targets 

Target Name 

iqn. 1995-04. com. examp le :sn. 5551212. target. 450 
iqn. 1995-04. com. example :sn. 5551212. target. 489 
iqn. 1995-04. com. examp le:sn. 8675309 
iqn. 2001-04. com. example, storage: tape. sysl. xyz 
iqn. 2001-04. com. example, storage, tape :sys1. xyz 

+■ 

In the above example, the user sees a collection of iSCSI Names, but 
with no real description of what they are for. They will, of course, 
map to a system-dependent device file or drive letter, but it's not 
easy looking at numbers quickly to see if everything is there. 

If a more intelligent target configures an alias for each target, the 
alias can provide a more descriptive name. This alias may be sent 
back to the initiator as part of the login response, or found in the 
iSCSI M I B. It then might be used in a display such as the following: 
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+■ — Connected-To-These-Targets- 
Al ias Target Name 



Oracle 1 
Local Disk 
Exchange 2 



iqn. 1995-04. com. example 
iqn. 1995-04. com. example 
iqn. 1995-04. com. example 



sn. 5551212. target. 450 
sn. 5551212. target. 489 
sn. 8675309 



This would give the user a better idea of what's really there. 

In general, flexible, configured aliases will probably be supported 
by larger storage subsystems and configurable gateways. Simpler 
devices will likely not keep configuration data around for things 
such as an alias. The TargetAlias string could be either left 
unsupported (not given to the initiator during login) or could be 
returned as whatever the "next best thing" that the target has that 
might better describe it. Since it does not have to be unique, it 
could even return SCSI inquiry string data. 



Note that if a simple initiator does not wish to keep or display 
alias information, it can be simply ignored if seen in the login 
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2. 3. Initiator Al ias 

An initiator alias can be used in the same manner as a target alias. 
An initiator may send the alias in a login request, when it sends its 
iSCSI Initiator Name. The alias is not used for authentication, but 
may be kept with the session information for display through a 
management GUI or command-line interface (for a more complex 
subsystem or gateway), or through the iSCSI MIB. 

Note that a simple target can just ignore the Initiator Alias if it 
has no management interface on which to display it. 

Usually just the hostname would be sufficient for an initiator alias, 
but a custom alias could be configured for the sake of the service 
provider if needed. Even better would be a description of what the 
machine was used for, such as "Exchange Server 1", or "User Web 
Server". 

Here's an example of a management interface showing a list of 
sessions on an iSCSI target network entity. For this display, the 
targets are using an internal target number, which is a fictional 
field that has purely internal significance. 
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+• — Connected-To-These- 1 n i t i ators 

Target Initiator Name 

450 iqn. 1995-04. com. example. sw:cd. 12345678-0EM-456 

451 iqn. 1995-04. com. examp I e. os : host i d. A598B45C 
309 iqn. 1995-04. com. example. sw:cd. 87654321-0EM-259 

+■ 

And with the initiator alias displayed: 

+ — Connected-To-These- 1 n i t i ators 

Target Alias Initiator Name 

450 Web Server 4 iqn. 1995-04. com. example, sw'.cd. 12. . . 

451 scsigw.example.com iqn. 1995-04. com. example, os : host i. . . 
309 Exchange Server iqn. 1995-04. com. example, swicd. 87. . . 

+- 

This gives the storage administrator a better idea of who is 
connected to their targets. Of course, one could always do a reverse 
DNS lookup of the incoming IP address to determine a host name, but 
simpler devices really don't do well with that particular feature due 
to blocking problems, and it won't always work if there is a firewall 
or iSCSI gateway involved. 

Again, these are purely informational and optional and require a 
management application. 

Aliases are extremely easy to implement. Targets just send a 
TargetAlias whenever they send a TargetName. Initiators just send an 
Init iatorAI ias whenever they send an In it i atorName. If an alias is 
received that does not fit. or seems invalid in any way, it is 
ignored. 



3. iSCSI Discovery 

The goal of iSCSI discovery is to allow an initiator to find the 
targets to which it has access, and at least one address at which 
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each target may be accessed. This should generally be done using as 
little configuration as possible. This section defines the discovery 
mechanism only; no attempt is made to specify central management of 
iSCSI devices within this document. Moreover, the iSCSI discovery 
mechanisms listed here only deal with target discovery and one still 
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needs to use the SCSI protocol for LUN discovery. 

In order for an iSCSI initiator to establish an iSCSI session with an 
iSCSI target, the initiator needs the IP address, TCP port number 
and iSCSI target name information. The goal of JSCS I discovery 
mechanisms are to provide low overhead support for small i SCS I 
setups, and scalable discovery solutions for large enterprise setups. 
Thus, there are several methods that may be used to find targets 
ranging from configuring a list of targets and addresses on each 
initiator and doing no discovery at all, to configuring nothing on 
each initiator, and allowing the initiator to discover targets 
dynamically. The various discovery mechanisms differ in their 
assumptions about what information is already available to the 
initiators and what information needs to be still discovered. 

iSCSI supports'the following discovery mechanisms: 

a. Static Configuration: This mechanism assumes that the IP address, 
TCP port and the iSCSI target name information are already 
available to the initiator. The initiators need to perform no 
discovery in this approach. The initiator uses the IP address and 
the TCP port information to establish a TCP connection, and it 
uses the iSCSI target name information to establish an iSCSI 
session. This discovery option is convenient for small j SCS I 
setups. 

b. SendTargets: This mechanism assumes that the target's IP address 
and TCP port information are already available to the initiator. 
The initiator then uses this information to establish a discovery 
session to the Network Entity. The initiator then subsequently 
issues the SendTargets text command to query information about the 
iSCSI targets available at the particular Network Entity (IP 
address). SendTargets command details can be found in the iSCSI 
draft [iSCSI]. This discovery option is convenient for iSCSI 
gateways and routers. 

c. Zero-Configuration: This mechanism assumes that the initiator does 
not have any information about the target. In this option, the 
initiator can either multicast discovery messages directly to the 
targets or it can send discovery messages to storage name servers. 
Currently, there are many general purpose discovery frameworks 
available such as Sa lutat ion [John] , Jini [John], UPnP[John], 
SLP[RFC2608] and iSNS[iSNS]. However, with respect to iSCSI, SLP 
can clearly perform the needed discovery functions [iSCSl-SLP], 
while iSNS [iSNS] can be used to provide related management 
functions including notification, access management, 
configuration, and discovery management. i SCS I equipment that 
need discovery functions beyond SendTargets should at least 
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implement SLP, and then consider iSNS when extended discovery 
management capabilities are required such as in larger storage 
networks. It should be noted that since iSNS will support SLP, 
iSNS can be used to help manage the discovery information returned 
by SLP. 
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Most security issues relating to iSCSI naming are discussed in the 
main iSCSI draft [iSCSI] and the iSCSI security draft [IPS-SEC]. 

In addition. Appendix B discusses naming and discovery issues when 
gateways, proxies, and firewalls are used to solve security or 
discovery issues in some situations where iSCSI is deployed. 

i SCS I allows several different authentication methods to be used. 
For many of these methods, an authentication identifier is used, 
which may be different from the iSCSI node name of the entity being 
authenticated. This is discussed in more detail in Appendix C. 



5. References 



[SAM2] 

[John] 

[RFC2979] 

[RFC3303] 

[iSCSI] 

[iSNS] 

[RFC1737] 



R. Weber et a I, INC ITS T10 Project 1157-D revision 24, "SCSI 
Architectural Model - 2 (SAM-2)", Section 4.7.6 "SCSI device 
name", September 2002. 

R. John, "UPnP, Jini and Salutation- A look at some popular 
coordination frameworks for future networked devices" 
http://www.cswl.com/whiteppr/tech/upnp.html". June 17] 1999, 

N. Freed, "Behavior of and Requirements for Internet 
Firewalls", RFC 2979, October 2000. 

P. Srisuresh et al, "Middlebox Communication Architecture 
and Framework", RFC 3303. August 2002. 

J. Satran et ai, "iSCSI", Work in Progress, draft- i etf- ips- 
iscsi-20.txt. January 2003. 

J. Tseng et al, "Internet Storage Name Service (iSNS)". Work 
in Progress, draft-ietf-ips-isns-17.txt, January 2003. 

K. Sol I ins, L Mas inter, "Functional Requirements for 
Uniform Resource Names", RFC 1737, December 1994. 



Voruganti, et. a I. 
Internet Draft 



Expires September 2003 
iSCSI Naming and Discovery 



[Page 13] 
March 2003 



[RFC1035] 
[EUI64] 

[RFC2396] 

[RFC2276] 

[RFC2483] 

[RFC2141] 
[RFC2611] 

[RFC2608] 
[RFC2610] 

[RFC2373] 



P. Mockapetris, "Domain Names - Implementation and 
Specification", RFC 1035. November 1987. 

EUI - "Guidelines for 64-bit Global Identifier (EUI-64) 
Registration Authority 

http://standards. ieee. org/regauth/oui/tutor ia ls/EUI64. html 

T. Berners-Lee. R. Fielding. L. Masinter. "Uniform Resource 
Identifiers", RFC 2396. August 1998. 

K. Sol I ins. "Architectural Principles of URN Resolution" 
RFC 2276, January 1998. 

M. Mealling, R. Daniel, Jr., "URI Resolution Services", RFC 
2483, January 1999. 

R. Moats, "URN Syntax". RFC 2141. May 1997. 

L. Daigle et al, "URN Namespace Definition Mechanisms". RFC 
2611, June 1999. 

E. Guttman et al, "SLP Version 2". RFC 2608. June 1999. 

C. Perkins. E. Guttman, "DHCP Options for the Service 
Location Protocol", RFC 2610, June 1999. 

R. Hindon, S. Deering, "IP Version 6 Addressing 

^-v(IO) 



, draft- j etf- i ps- i scs i -name-d i sc-09 
Architecture", RFC 2373, July 1998. 

[RFC2732] R. Hindon, B. Carpenter, L Mas inter, "Format for Literal 
IPv6 Addresses in URLs", RFC 2732, December 1999. 

[iSCSI-SLP] M. Bakke et al, "Finding iSCSI Targets and Name Servers 
using SLP", Work in Progress, draft- i etf- ips-i scs i- 
slp-05.txt, March 2003. 

[IPS-SEC] B. Aboba et al, "Securing Block Storage Protocols over IP", 
Work in Progress, draft-ietf-ips-security-19.txt, January 
2003. 



Voruganti, et. a I. Expires September 2003 [Page 14] 

Internet Draft iSCSI Naming and Discovery March 2003 



6. Authors' Addresses 

Address comments to: 

Kaladhar Voruganti 

IBM Almaden Research Center 

650 Harry Road 

San Jose, CA 95120 

Email: kaladhar@us.ibm.com 

Mark Bakke 
Cisco Systems, Inc. 
6450 Wedgwood Road 
Maple Grove, MN 55311 
Phone: +1 763 398-1054 
Email: mbakke@cisco.com 

Jim Hafner 

IBM Almaden Research Center 

650 Harry Road 

San Jose, CA 95120 

Phone: +1 408 927-1892 

Emai I : hafner@a Imaden. ibm. com 

John L. Hufferd 

IBM Storage Systems Group 

5600 Cottle Road 

San Jose, CA 95193 

Phone: +1 408 256-0403 

Emai I : hufferdius. ibm. com 

Mar jor ie Krueger 
Hewlett-Packard Corporation 
8000 Foothills Blvd 
Roseville, CA 95747-5668, USA 
Phone: +1 916 785-2656 
Emai I : mar jor ie_krueger@hp. com 



^-v(11) 



d r af t- i etf- i ps- i scs i -name-d i sc-09 



Voruganti, et. al. Expires September 2003 [Page 15] 

Internet Draft iSCSI Naming and Discovery March 2003 

Appendix A: iSCSI Naming Notes 

Some iSCSI Name Examples for Targets 

- Assign to a target based on controller serial number 
iqn. 2001-04. com. example :diskar ray. sn. 8675309 

- Assign to a target based on serial number 

iqn. 2001-04. com. example:diskarray. sn. 8675309. oracle_db_1 

Where oracle_db_1 might be a target label assigned by a user. 

This would be useful for a controller that can present different 
logical targets to different hosts. 

Obviously, any naming authority may come up with its own scheme and 
hierarchy for these names, and be just as valid. 

A target iSCSI Name should never be assigned based on interface 
hardware, or other hardware that can be swapped and moved to other 
devices. 

Some iSCSI Name Examples for Initiators 

- Assign to the OS image by fully qualified host name 

i qn. 2001 -04. com. examp I e. os : dns. com. customer 1 . host-four 

Note the use of two FODNs - that of the naming authority and also 
that of the host that is being named. This can cause problems, due 
to limitations imposed on the size of the i SCS I Name. 

- Assign to the OS image by OS install serial number 

iqn. 2001-04. com. example. os:newos5. 12345-OEM-0067890-23456 

Note that this breaks if an install CD is used more than once. 
Depending on the 0/S vendor's philosophy, this might be a feature. 

- Assign to the Raid Array by a service provider 
iqn. 2001-04. com. example, myssp: users. mbakke05657 
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Appendix B: Interaction with Proxies and Firewalls 

iSCSI has been designed to allow SCSI initiators and targets to 
communicate over an arbitrary network. This, making some assumptions 
about authentication and security, means that in theory, the whole 
internet could be used as one giant storage network. 

However, there are many access and scaling problems that would come 
up when this is attempted. 

1. Most iSCSI targets may only meant to be accessed by one or a few 

^-v(12) 



dr af t- i etf- i ps- i scs i -name-d i sc-09 
initiators. Discovering everything would be unnecessary. 

2. The initiator and target may be owned by separate entities each 
with their own directory services, authentication, and other schemes 
An iSCSI-aware proxy may be required to map between these things. 

3. Many environments use non-routable IP addresses, such as the "10 " 
network. 

For these and other reasons, various types of firewalls [RFC2979] and 
proxies will be deployed for iSCSI, similar in nature to those 
already handling protocols such as HTTP and FTP. 

B. 1. Port Red! rector 

A port red i rector is a stateless device that is not aware of iSCSI 
It is used to do Network Address Translation (NAT) , which can map IP 
addresses between routable and non-routable domains, as well as map 
TCP ports. While devices providing these capabilities can often 
filter based on IP addresses and TCP ports, they generally do not 
provide meaningful security, and are used instead to resolve internal 
network routing issues. 

Since it is entirely possible that these devices are used as routers 
and/or aggregators between a firewall and an iSCSI initiator or 
target, iSCSI connections must be operable through them. 

Effects on i SCSI : 

- r SCS J - level data integrity checks must not include information from 
the TCP or IP headers, as these may be changed in between the 
initiator and target. 

- iSCSI messages that specify a particular initiator or target, such 
as login requests and third party requests, should specify the 
initiator or target in a locat ion- independent manner. This is 
accomplished using the iSCSI Name. 
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- When an iSCSI discovery connection is to be used through a port 
redirector, a target will have to be configured to return a domain 
name instead of an IP address in a SendTargets response, since the 
port redirector will not be able to map the IP address (es) returned 
in the iSCSI message. It is a good practice to do this anyway. 

B. 2. SOCKS server 



A SOCKS server can be used to map TCP connections from one network 
domain to another. It is aware of the state of each TCP connection. 

The SOCKS server provides authenticated firewall traversal for 
applications that are not firewall-aware. Conceptually, SOCKS is a 
"shim-layer" that exists between the application (i.e., iSCSI) and 



To use SOCKS, the iSCSI initiator must be modified to use the 
encapsulation routines in the SOCKS library. The initiator the opens 
up a TCP connection to the SOCKS server, typically on the canonical 
SOCKS port 1080. A sub-negotiation then occurs, during which the 
initiator is either authenticated or denied the connection request. 
If authenticated, the SOCKS server then opens a TCP connection to the 
iSCSI target using addressing information sent to it by the initiator 
in the SOCKS shim. The SOCKS server then forwards iSCSI commands, 
data, and responses between the iSCSI initiator and target. 

Use of the SOCKS server requires special modifications to the iSCSI 
initiator. No modifications are required to the iSCSI target. 

As a SOCKS server can map most of the addresses and information 
contained within the IP and TCP headers, including sequence numbers. 
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its effects on iSCSI are identical to those in the port redirector. 

B. 3. SCS I gateway 

This gateway presents logical targets (iSCSI Names) to the 
initiators, and maps them to SCSI targets as it chooses. The 
initiator sees this gateway as a real iSCSI target, and is unaware of 
any proxy or gateway behavior. The gateway may manufacture its own 
iSCSI Names, or map the iSCSI names using information provided by the 
physical SCSI devices. It is the responsibility of the gateway to 
ensure the uniqueness of any iSCSI name it manufactures. The gateway 
may have to account for multiple gateways having access to a single 
physical device. This type of gateway is used to present parallel 
SCSI, Fibre Channel, SSA, or other devices as iSCSI devices. 

Effects on iSCSI : 
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- Since the initiator is unaware of any addresses beyond the gateway, 
the gateway's own address is for all practical purposes the real 
address of a target. Only the iSCSI Name needs to be passed. This 
is already done in iSCSI. so there are no further requirements to 
support SCSI gateways. 

B. 4. iSCSI Proxy 

An iSCSI proxy is a gateway that terminates the iSCSI protocol on 
both sides, rather than translate between iSCSI and some other 
transport. The proxy functionality is aware that both sides are 
iSCSI, and can take advantage of optimizations, such as the 
preservation of data integrity checks. Since an iSCSI initiator's 
discovery or configuration of a set of targets makes use of address- 
independent iSCSI names, iSCSI does not have the same proxy 
addressing problems as HTTP, which includes address information into 
its URLs. If a proxy is to provide services to an initiator on 
behalf of a target, the proxy allows the initiator to discover its 
address for the target, and the actual target device is discovered 
only by the proxy. Neither the initiator nor the iSCSI protocol 
needs to be aware of the existence of the proxy. Note that a SCSI 
gateway may also provide iSCSI proxy functionality when mapping 
targets between two iSCSI interfaces. 

. Effects on i SCS I : 

- Same as a SCSI gateway. The only other effect is that iSCSI must 
separate data integrity checking on iSCSI headers and iSCSI data, 
to allow the data integrity check on the data to be propagated end- 
to-end through the proxy. 

B. 5. Stateful Inspect ion Fi rewa 1 1 (stealth iSCSI firewall) 

The stealth model would exist as an i SCS I -aware firewall, that is 
invisible to the initiator, but provides capabilities found in the 
iSCSI proxy. 

Effects on » SCS I : 

- Since this is invisible, there are no additional requirements on 
the iSCSI protocol for this one. 

This one is more difficult in some ways to implement, simply because 
it has to be part of a standard firewall product, rather than part of 
an iSCSI-type product. 

Also note that this type of firewall is only effective in the 
outbound direction (allowing an initiator behind the firewall to 
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connect to an outside target), unless the iSCSI target is located in 
a DMZ (De-Militarized Zone) [RFC3303]. It does not provide adequate 
security otherwise. 



Appendix C: iSCSI Names and Security Identifiers 

This document has described the creation and use of iSCSI Node Names 
There will be trusted environments where this is a sufficient form of 
identification. In these environments the iSCSI Target may have an 
Access Control List (ACL), which will contain a list of authorized 
entities that are permitted to access a restricted resource (in this 
case a Target Storage Controller). The iSCSI Target will then use 
that ACL to permit (or not) certain iSCSI Initiators to access the 
storage at the iSCSI Target Node. This form of ACL is used to prevent 
trusted initiators from making a mistake and connecting to the wrong 
storage control ler. 

It is also possible that the ACL and the iSCSI Initiator Node Name 
can be used in conjunction with the SCSI layer for the appropriate 
SCSI association of LUNs with the Initiator. The SCSI layer's use of 
the ACL will not be discussed further in this document. 

There will be situations where the iSCSI Nodes exist in untrusted 
environments. That is. some iSCSI Initiator Nodes may be authorized 
to access an iSCSI Target Node, however, because of the untrusted 
environment, nodes on the network cannot be trusted to give the 
correct iSCSI Initiator Node Names. 

In untrusted environments an additional type of identification is. 
required to assure the target that it really knows the identity of 
the requesting entity. 

The authentication and authorization in the i SCS I layer is 
independent of anything that IPSec might handle, underneath or around 
the TCP layer. This means that the initiator node needs to pass some 
type of security related identification information (e.g. user id) to 
a security authentication process such as SRP, CHAP, Kerberos etc. 
(These authentication processes will not be discussed in this 
document) . 

Upon the completion of the iSCSI security authentication, the 
installation knows "who" sent the request for access. The 
installation must then check to ensure that such a request, from the 
identified entity, is permitted/authorized. This form of 
Authorization is generally accomplished via an Access Control List 
(ACL) as described above. Using this authorization process, the 
iSCSI target will know that the entity is authorized to access the 
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iSCSI Target Node. 

It may be possible for an installation to set a rule that the 
security identification information (e.g. UserlD) be equal to the 
iSCSI Initiator Node Name. In that case, the ACL approach described 
above should be all the authorization that is needed. 

If, however, the iSCSI Initiator Node Name is not used as the 
security identifier there is a need for more elaborate ACL 
functionality. This means that the target requires a mechanism to map 
the security identifier (e.g. UserlD) information to the iSCSI 
Initiator Node Name. That is, the target must be sure that the 
entity requesting access is authorized to use the name, which was 
specified with the Login Keyword " In it iatorName=". For example, if 
security identifier 'Frank' is authorized to access the target via 
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(SCSI lnitiatorName=xxxx. but 'Frank' tries to access the target via 
iSCSI JnitiatorName=yyyy, then this login should be rejected. 

On the other hand, it is possible that 'Frank' is a roaming user (or 
a Storage Administrator) that "owns" several different systems, and 
thus, could be authorized to access the target via multiple different 
iSCSI initiators. In this case, the ACL needs to have the names of 
all the initiators through which 'Frank' can access the target. 

There may be other more elaborate ACL approaches, which can also be 
deployed to provide the installation/user with even more security 
with f lexibi I ity. 

The above discussion is trying to inform the reader that, not only is 
there a need for access control dealing with iSCSI Initiator Node 
Names, but in certain iSCSI environments there might also be a need 
for other complementary security identifiers. 
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Abstract 

By providing direct data transfer between storage and client, net- 
work-attached storage devices have the potential to improve scal- 
ability for existing distributed file systems (by removing the server 
as a bottleneck) and bandwidth for new parallel and distributed file 
systems (through network striping and more efficient data paths). 
Together, these advantages influence a large enough fraction of the 
storage market to make commodity network-attached storage fea- 
sible. Realizing the technology's -full potential requires careful 
consideration across a wide range of file system, networking and 
security issues. This paper contrasts two network-attached storage 
architectures — (1) Networked SCSI disks (NetSCSI) are network- 
attached storage devices with minimal changes from the familiar 
SCSI interface, while (2) Network- Attached Secure Disks (NASD) 
are drives that support independent client access to drive object 
services. To estimate the potential performance benefits of these 
architectures, we develop an analytic model and perform trace- 
driven replay experiments based on AFS and NFS traces. Our 
results suggest that NetSCSI can reduce file server load during a 
burst of NFS or AFS activity by about 30%. With the NASD archi- 
tecture, server load (during burst activity) can be reduced by a fac- 
tor of up to five for AFS and up to ten for NFS. 

1 Introduction 

Users are increasingly using distributed file systems to access 
data across local area networks; personal computers with hundred- 
plus MIPS processors are becoming increasingly affordable; and 
the sustained bandwidth of magnetic disk storage is expected to 
exceed 30 MB/s by the end of the decade. These trends place a 
pressing need on distributed file system architectures to provide 
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clients with efficient, scalable, high-bandwidth access to stored 
data This paper discusses a powerful approach to fulfilling this 
need Network-attached storage provides high bandwidth by 
directly attaching storage to the network, avoiding file server 
store-and-forward operations and allowing data transfers to be 
striped over storage and switched-network links. 

The principal contribution of this paper is to demonstrate the 
potential of network-attached storage devices for penetrating the 
markets defined by existing distributed file system clients, specifi- 
cally the Network File System (NFS) and Andrew Hie System 
(AFS) distributed file system protocols. Our results suggest that 
network-attached storage devices can improve overall distributed 
file system cost-effectiveness by offloading disk access, storage 
management and network transfer and greatly reducing the amount 
of server work per byte accessed. 

We begin by charting the range of network-attached storage 
devices that enable scalable, high-bandwidth storage systems. Spe- 
cifically, we present a taxonomy of network-attached storage — 
server-attached disks (SAD), networked SCSI (NetSCSI) and net- 
work-attached secure disks (NASD) — and discuss the distributed 
file system functions offloaded to storage and the security models 
supportable by each. 

With this taxonomy in place, we examine traces of requests 
on NFS and AFS file servers, measure the operation costs of com- 
monly used SAD implementations of these file servers and 
develop a simple model of the change in manager costs for NFS 
and AFS in NetSCSI and NASD environments. Evaluating the 
impact on file server load analytically and in trace-driven replay 
experiments, we find that NASD promises much more efficient 
file server offloading in comparison to the simpler NetSCSI. With 
this potential benefit for existing distributed file server markets, 
we conclude that it is worthwhile to engage in detailed NASD 
implementation studies to demonstrate the efficiency, throughput 
and response time of distributed file systems using network- 
attached storage devices. 

In Section 2, we discuss related work. Section 3 presents our 
taxonomy of network-attached storage architectures. In Section 4, 
we describe the NFS and AFS traces used in our analysis and 
replay experiments and report our measurements of the cost of 
each server operation in CPU cycles. Section 5 develops an ana- 
lytic model to estimate the potential scaling offered by server-off- 
loading in NetSCSI and NASD based on the collected traces and 
the measured costs of server operations. The trace-driven replay 
experiment and the results are the subject of Section 6. Finally, 
Section 7 presents our conclusions and discusses future directions. 
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2 delated Work 

• i - * 

Distributed file systems provide remote access to shared file 
storage in a networked environment [Sandberg85, Howard88, 
Minshall94], A principal measure of a distributed file system's 
cost is the computational power required from the servers to pro- 
vide adequate performance for each client's work [Howard88, 
Nelson88]. While microprocessor performance is increasing dra- 
matically and raw computational power would not normally be a 
concern, the work done by a file server is data- and interrupt-inten- 
sive and, with the poorer locality typical of operating systems, 
faster microprocessors, will provide much less benefit than their 
cycle time trends prony$e [Oust^ Chen93], 

Typically, distributed file systems employ client caching to 
reduce this server load. For example, AFS clients use local disk to 
cache a subset of the global system's files. While client caching is 
essential for high performance, increasing file sizes, computation 
sizes, and workgroup sharing are all inducing more misses per 
cache block [Ousterhout85, Baker91]. At the same time, increased 
client cache sizes are making these misses more bursty. 

When the post-client-cache server load is still too large, it can 
either be distributed over multiple servers or satisfied by a custom- 
designed high-end file server.' Multiple-server distributed file sys- * 
terns attempt to balance load by partitioning the namespace and 
replicating static, commonly used files. This replication and parti- 
tioning is too often ad-hoc, leading to the "hotspot" problem famil- 
iar in multiple-disk mainframe systems [Kim86] and requiring 
frequent user-directed load balancing. Not surprisingly, custom- 
designed high-end file servers more reliably provide good perfor- 
mance, but can be an expensive solution [Hitz90, Drapeau94]. 

Experience with disk arrays suggests another solution. If data 
is striped over multiple independent disks of an array, then a high- 
concurrency workload will be balanced with high probability as 
long as individual accesses are small relative to the unit of inter- 
leaving [Linvy87, Patterson88, Chen90]. Similarly, striping file 
storage across multiple servers provides parallel transfer of large 
files and balancing of high concurrency workloads [Hartman93]; 
striping of metadata promises further load -balancing [Dahlin95]. 

Scalability prohibits the use of a single shared-media net- 
work; however, with the emergence of switched network fabrics 
based on high-speed point-to-point links, striped storage can scale 
bandwidth independent of other traffic in the same fabric 
[Amould89, Siu95, Boden95]. Unfortunately, current implementa- 
tions of Internet protocols demand significant processing power to 
deliver high bandwidth — we observe as much as 80% of a 233 
MHz DEC Alpha consumed by UDP/TP receiving 135 Mbps over 
155 Mbps ATM (even with adaptor support for packet reassem- 
bly). Improving this bandwidth depends on interface board designs 
[Steenkiste94, Cooper90], integrated layer processing for network 
protocols [ClarkS9], direct application access to the network inter- 
face [vonEiken92, Maeda93], copy avoiding buffering schemes 
[Druschel93, Brustoioni96], and routing support for high-perfor- 
mance best-effort traffic [Ma96, Traw95). Perhaps most impor- 
tantly, the protocol stacks resulting from these research efforts 
must be deployed widely. This deployment is critical because the 
comparable storage protocols, SCSI, and soon, Fibre Channel, pro- 
vide cost-effective hardware implementations routinely included 
in client machines. For comparison, a 175 MHz DEC Alpha con- 
sumes less than 5% of its processing power fetching 100 Mbps 
from a 160 Mbps SCSI channel via the UNIX raw disk interface. 



To exploit the economics of large systems resulting frorri 'W 
cobbling together of many client purchases, the xFS file, system 
distributes code, metadata and data over all clients, eb'minarJng the 
need for a centralized storage system [Dahlin95]. This scheme nat- 
urally matches increasing client performance with increasing 
server performance. Instead of reducing the server workload, how- 
ever, it takes the required computational power from another, fre- 
quently idle, ctienL Complementing the advantages of filesystems 
. such as xFS, the network-attached storage architectures presented 
in this paper significantly reduce the demand for server computa- 
tion and eliminate file server machines from the storage data path, 
reducing the coupling between overall file system integrity and the 
security of individual client machines. 

As distributed file system technology has improved, so have 
the storage technologies employed by these systems. Storage den- 
sity increases, long a predictable 25% per year, have risen to 60% 
increases per year during the 90s. Data rates, which were con- 
strained by storage interface definitions until the mid-80s, have 
increased by about 40% per year in the 90s [Grochowsb*96]. The 
acceptance, in all but the lowest cost market, of SCSI, whose inter- 
face exports the abstraction of a linear array of fixed-size blocks 
provided by an embedded controller [ANSI86], catalyzed rapid 
deployment of technology advances, resulting in an extremely 
competitive storage market. 

The level of indirection introduced by SCSI has also led to 
. transparent improvements in storage performance such as RAID; 
transparent failure recovery; real-time geometry-sensitive schedul- 
ing; buffer caching; read-ahead and write-behind; compression; 
dynamic mapping; and representation migration [Patterson88i 
Gibson92, Massiglia94, StorageTek94, Wilkes95, Ruemmler91, 
Varma95]. However, in order to overcome the speed, addressabil- 
ity and connectivity limitations of current SCSI implementations 
ISachs94, ANSI95], the industry is turning to high-speed pack- 
etized interconnects such as Fibre Channel at up to 1 Gbps 
[Benner96]. The disk drive industry anticipates the marginal cost 
for on-disk Fibre Channel interfaces, relative to the common sin- 
gle-ended SCSI interface in use today, to be comparable to the 
marginal cost for high-performance differential SCSI (a difference 
similar to the cost of today's Ethernet adapters) while their host 
adapter costs are expected to be comparable to high-performance 
SCSI adapters [Anders on95]. 

The idea of simple, disk-like network-based storage servers 
whose functions are employed by higher-level distributed file sys- 
tems, has been around for a long time [Birrel80, Katz92]. The 
Mass Storage System Reference Model (MSSRM), an early archi- 
tecture for hierarchical storage subsystems, has advocated the sep- 
aration of control and data paths for almost a decade [Miller88, 
IEEE94]. Using a high-bandwidth network that supports direct 
transfers for the data path is a natural consequence [Kronenberg86, 
Drapeau94, Long94, Lee95, Menasce96, VaiuMeter96]. The 
MSSRM has been implemented in the High Performance Storage 
System (HPSS) [Watson95] and augmented with socket-level 
striping of file transfers [Berdah)95, Wiltzius95], over the multiple 
network interfaces found on mainframes and supercomputers. 1 



following Van Meter's [VanMeter96] definition of network-attached 
peripherals, we consider only networks that are shared with general local 
area network traffic and not single-vendor systems whose interconnects are 
fast, isolated local area networks [Horst95, EEEE92]. - 
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figure 1: Server-attached disks (SAD) are the familiar local area network distributed file 
systems. A client wanting data from storage sends a message to the file server (1), which sends a 
message to storage (2), which accesses the data and sends it back to the rile server (3), which 
finally sends the requested data back to the client (4). Server-integrated disk (SID) is logically the. 
same except that hardware and software in the file server machine may be specialized to the file 
service function. 



Striping data across multiple storage servers with indepen- 
dent ports into a scalable local area network has been advocated as 
y a means of obtaining scalable storage bandwidth [Hartman93]. If 
v he storage servers of this architecture are network-attached 
devices, rather than dedicated machines between the network and 
storage, efficiency is further improved by avoiding store-and-for- 
ward delays through the server. 

Our notion of network-attached storage is consistent with 
these projects. However, our analysis focuses on the evolution of 
commodity storage devices rather than niche-market, very high- 
end systems, and on the interaction of network-attached storage 
with common distributed file systems. Because all prior work 
views the network-based storage as a function provided by an 
additional computer, instead of the storage devices itself, cost- 
effectiveness has never been within reach. Our goal is to chart the 
way network-attached storage is likely to appear in storage prod- 
ucts, estimate its scalability implications, and characterize the 
security and file system design issues in its implementation. 

3 Taxonomy of Network- Attached Storage 

Simply attaching storage to a network underspecifies net- 
work-attached storage's role in distributed file systems' architec- 
tures. In the following subsections, we present a taxonomy for the 
functional composition of network-attached storage. Case 0, the 
base case, is the familiar local area network with storage privately 
connected to file server machines — we call this server-attached 
disks. Case 1 represents a wide variety of current products, server- 
integrated disks, that specialize hardware and software into an 
integrated file server product. In Case 2, the obvious network- 
attached disk design, network SCSI, minimizes modifications to 
the drive command interface, hardware and software. Finally, 
Case 3, network-attached secure disks, leverages the rapidly 
increasing processor capability of disk-embedded controllers to 
restructure the drive command interface. 

3.1 Case 0: Server-Attached Disks (SAD) 

This is the system familiar to office and campus local area 
networks as illustrated in Figure 1. Clients and servers share a net- 
work and storage is attached directly to general-purpose worksta- 
. tions that provide distributed file services. 



3.2 Case 1: Server Integrated Disks (SID) 

.--Since file server machines often do little other than service 
distributed file system requests, it makes sense to construct spe- 
cialized systems that perform only file system functions and not 
general-purpose computation. This architecture is not fundamen- 
tally different from SAD. Data must still move through the server 
machine before it reaches the network, but specialized servers can 
move this data more efficiently than general-purpose machines. 
Since high performance distributed file service benefits the pro- 
ductivity of most users, this architecture occupies an important 
market niche [Hitz90, Hitz94]. However, this approach binds stor- 
age to a particular distributed file system, its semantics, and its 
performance characteristics. For example, most server-integrated 
disks provide NFS file service, whose inherent performance has 
long been criticized [Howard88]. Furthermore, this approach is 
undesirable because it does not enable distributed file system and 
storage technology to evolve independently. Server striping, for 
instance, is not easily supported by any of the currently popular 
distributed file systems. Binding the storage interface to a particu- 
lar distributed file system hampers the integration of such new fea- 
tures [Bin-ell 80]. 

33 Case 2: Network SCSI (NetSCSI) 

The other end of the spectrum is to retain as much as possible 
of SCSI, the current dominant mid- and high-level storage device 
protocol. This is the natural evolution path for storage devices; 
Seagate's Barracuda FC is already providing packetized SCSI 
through Fibre Channel network pons to directly attached hosts 
[Seagate96]. NetSCSI is a network-attached storage architecture 
that makes minimal changes to the hardware and software of SCSI 
disks. J&^mana^ 
5^s^t&^ 

ilar to the support for third-parry transfers already supported by 
SCSI [Drapeau94]. The efficient data transfer engines typical of 
fast drives ensure that the drive's sustained bandwidth is available 
to clients. Further, by eliminating the file manager from the data 
path, its workload per active client decreases. However, the use of 
third-party transfer changes the drive's role in the overall security 
of a distributed file system. While it is not unusual for distributed 
file systems to employ a security protocol between clients and 
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attached disk architecture designed for , JLLf££ 
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servers (eg. Kerberos authentication), disk drives do not yet par- 
ticipate in this protocol. • 

We identify four levels of security within the NetSCSI model- 
(1) accident-avoidance with a second private network between file 
manager and disk, both locked in a physically secure room- (2) 
data transfer authentication with clients and drives equipped with a 
strong cryptographic hash function; (3) data transfer privacy with 
both clients and drives using encryption and; (4) secure key man- 
agement with a secure coprocessor. 

v. , 2 ^ hows simplest security enhancement to 
wetiCM: a second network port on each disk. Since SCSI disks 
execute every command they receive without an explicit authori- 
zation check, without a second port even well-meaning clients can 
generate erroneous commands and accidentally damage parts of 
the file system. The drive'* second network port provides protec- 
tion from accidents while allowing SCSI command interpreters to 
conunue following their norma] execution model This is the 
architecture employed in the SIOF and HPSS projects at LLNL. 
rWi]tz.us95. Watson95]. Assuming that file manager and NetSCSI 
disks are locked in a secure room, this mechanism is acceptable for 
the misted network security model of NFS [Sandberg85] 

Because file data still travels over the potentially hostile gen- 
eral network. NetSCSI disks are likely to demand greater security 
than simple accident avoidance. Cryptographic protocols can 
strengthen the security of NetSCSI. A strong cryptographic hash 
function, such as SHA [N1ST94], computed at the drive and at the 
chen. would allow data transfer authentication (i.e., the correct 
data was received only if the sender and receiver compute the 
same hash on the data). 

For some applications, data transfer authentication is insuffi- 
NeScS* mUSt bC 3ble l ° CnCrypt and dec W data 



over the untrusted netwo rk Hn»,v„ 

keys w,ll be stored in dev IC es vulnerable to physical attack the 
servers must still be stored in physically secure environments If 
we go one step further and equip NetSCSI disks with secure copro- 
cessors [Tygar95], then keys can be protected and all data can be 
encrypted when outside the secure coprocessor, allowing the disks 
io be used in a variety of physically open environments. There are 
now a variety of secure coprocessors [NlST94a, Weingart87 



Whites?. National96] available, some of which promise crypto- 
graphic accelerators sufficient to support single-disk bandwidth! 

3.4 Case 3: Network-attached Secure Disks (NASD) 

With network-attached secure disks, we relax the constraint 
of minimal change from the existing SCSI interface and imple- 
mentation Instead we focus on selecting a command interface that 
reduces the number of client-storage interactions that must be 
relayed through the file manager, offloading more of the file man- 

a£Cr r™*™*y ™ eeratiTl S flle svstem Policy into the disk. 

ich as rea ds and wrj ^. 
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^opposed to.NelSCSl, where a significant part of ftT^ni 
cessmg for security is performed on the file manager. NASD 
drives perform most of the processing to enforce the security pol- 
icy. Sjiecifically. the cryptographic functions and the enforcement 
Of. manager decisions are implemented at the drive, while policy 
decsions are made in the file manager. Because clients directly 
request access to data in their files, a NASD drive must have suffi- 
cient metadata to map and authorize the request to disk sectors 
Authorization, in the form of a time-limited capability applicable 
to the file s map and contents, should be provided by the file man- 
ager to protect higher-level file systems' control over storage 
access policy. The storage mapping metadata, however, could be 
prov,ded dynamically rVanMeter96a] by the file manager or could 
be maintained by the drive. Wh i)e the latter approach asks distrib- 
uted file system authors to sutTender detailed control over the lav 
out of the files they create, it enables smart drives to bener exploit 
detailed knowledge of their own resources to optimize data layout 
read-ahead, and cache management [deJonge93. Patterson95i 
Goldmg95]. This is precisely the type of value-added opportunity 
that n.mble storage vendors can exploit for market and customer 
advantage. With mapping metadata at the drive controlling the lav- 
out of files, a NASD drive exports a namespace of file-like objects 
Because control of naming is more appropriate to the higher-level 
file system, pathnames are not understood at the drive, and path- 
name resolution is split between the file manager and client. While 
a single dnve object will suffice to represent a simple chent file 
multiple objects may be logically linked by the file system into 
one chent file. Such an mterface provides support for banks of 
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Figure 3: Network-attached secure disks (14ASD) are designed to offload more of the file system's simple and 
performance-critical operadons.^Bor example, in one potential protocol a client, prior to reading a file, requests 
access to that file from the file manager (1), which delivers a capability to the authorized client (2). So equipped, 
the client may make repeated areesscs to different regions of the file (3, 4) without contacting the file manager 
again unless the file manager chooses to force reauthorization by revoking the capability (5). 



striped files [Hartrnan93], Macintosh-style resource forks, or logi- 
cally-contiguous chunks of complex files [deJong93]. 

As an example of a possible NASD access sequence, consider 
a file read operation depicted in Figure 3. Before issuing its first 
read of a file, the client authenticates itself with the file manager 
and requests access to the filer If access is granted, the client 
receives the network location of the NASD drive containing the 
object and a time-limited capability to access the object and for 
establishing a secure communications channel with the drive. 
After this point, the client may directly request access to data on 
NASD drives, using the appropriate capability [Gfobioff96]r 

In addition to offloading file read operations from the distrib- 
uted file manager, later sections will show that NASD should also 
offload file writes and attributes reads to the drive. High-level file 
system policies, such as access control and cache consistency, 
however, remain the purview of the file manager. These policies 
are enforced by NASD drives according to the capabilities con- 
trolled by the file manager. 

3.5 Summary 

This taxonomy, summarized in Table 1, splits into two 
classes — SAD and SED offer a specific distributed file system 
while NetSCSI and NASD offer enhanced storage interfaces. The 
difference between SID and NASD merits further consideration. 
Many of the optimizations we propose for NASD, such as short- 
ened data paths and specialized protocol processing, can also be 
implemented in a SID architecture. However, SID binds storage to 
a particular distributed file system, requires higher-level (or multi- 
ple- SID) file management to offer network striped files and, by not 
evolving the drive interface, inhibits the independent development 
of drive technology. For the rest of this paper, we focus on SAD, 
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and SID require the file manager (file server) to handle each byte of data, 
but SID allows specialization of the hardware and software to file service. 
NetSCSI allows direct transfers to clients, but requires file manager 
interaction on each operation to manage metadata. 



NetSCSI, and NASD and present a coarse-grained estimate of the 
•potential benefit of network-attached storage. The results suggest 
that by exploiting the processing power available in next genera- 
tion storage devices, computation required from the file manager 
machines can be dramatically reduced, enabling the per-byte cost 
of distributed file service to be reduced. 

* * * 

4 Analysis of File System Workload 

To develop an understanding of performance parameters crit- 
ical to network-attached storage, we performed a series of mea- 
surements to (1) characterize the behavior and cost of AFS and 
NFS distributed file server functionality; and (2) identify and sub- 
set busy periods during which server load is limiting. 

4.1 Trace Data 

Our data is taken from NFS and AFS file system traces sum- 
marized in Table 2. The NFS trace [Dahlin94] records the activity 
of an Auspex file server supporting 23 1 client machines over a one 
week period at the University of California at Berkeley 2 . The AFS 
trace records the activity of our laboratory's Sparcstation 20 AFS 
server supporting 250 client machines over a one month period 3 





NFS trace 


AFS trace 


Number of client machines 


231 


250 


Total number of requests 


6,676,479 


1,615,540 


Read data transferred (GB) 


8.1 


2.9 


Write data transferred (GB) 


2.0 


1.6 


Trace period 


9/20/93-9/24/93 
40 hours 


9/9/96-10/3/96 
435 hours 3 



— -J- — — — w — '-mm • • • ^/l|/viilllWlliJ. A + 1^- 4 

trace was collected in a study performed at the Univcrsiry of California 
al Berkeley. The AFS trace was collected by logging requests at the AFS 
file server in our laboratory. 



2. 

Some attribute reads were removed from the NFS trace by the Berkeley 
researchers based on a heuristic for eliminating excessive cache consis- 
tency traffic. Because this change is pessimistic to our proposed architec- 
ture, we choose to continue to use these traces, already familiar to the 
community, rather than collect new traces. 

■'The trace covers three periods of activity - 9/9- 10. 9/ 13- 15, and 9/20-10/3. 
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Trace 
He cord 


NFS 

* *• 

Operations 


AttrRead. 


getatF 
setattr 


AttrWrite "~ 


BlockRead 
BlockWrite 


read 
write" 


DirRead 


lookup, readdir 


DirReadWrite 
peieteWriteT" 


create, mkdir, 
rename, etc. 
unlink, rmdir 



Table 3(a) - NFS Trace Operations 



Description 

Get metadata i nformation 
Update metadata info rmation 
Get data from s erver 
Send data to server 



ename, etc. 



Percent 



Table 3(b) - NFS Cost Measurements 



% of Cycles 



35.5 



Data Size 
(bytes) 


Read Cycles 
(thousands) 


Write Cycles 
(thousands) 


1 


54 


117 


TK 


61 




2K 


68 




4K 


78 


148 


8,000 


100 


199 



Table 3(c) 



Operation 


Cycles 
(thousands) 


getattr 


33 


setattr 


64 


lookup 


50 


readdir (1 entry) 


63 


readdir (40 entries) 


105 



Table 3(d) 



Operation 


Cycles 
(thousands) 


create 


81 


unlink (last link) 


135 



*S ken,d - DEC J°£ /400 « 33 ^ « MB of memory. 

to* produced m ^ buffer cacbc _ dijcardcd £ ^^^^^^^^^^^ .3 

Both the NFS and AFS traces document each client request "* " 



with an amval timestamp. a unique client host id, and an indica- 
tion of the request type. The AFS trace records the exact type of 
primitive AFS file system request and also includes a response 
timestamp. The NFS trace only records the general class of the 
issued request which leaves some ambiguity in determining 
exactjy which primitive NFS requests were issued (e.g., a request 
recorded as a directory read may have been either a lookup or 
readdir request). 

The original NFS trace is dominated by overnight backup 
activity. Since users are mostly insensitive to backup performance 
and this is not a major concern of this study, we exclude this activ- 
ity by only including requests timestamped Monday through Fri- 
day between 9am and 5pm in our data set. The AFS trace does not 
include any backupactivity because AFS backups are handled by a 
separate task on the file server machine. 

4.2 Cost of NFS and AFS Operations 

Our trace data captures the types and relative frequencies of 
client requests but does not include the amount of CPU work per- 
formed by the file server in handling each request. To estimate this 
cost, we measured NFS and AFS server code paths on Digital 
Equipment Alpha workstations. Specifically, we used the ATOM 
binary annotation tool [Srivastava94] and the Alpha's on-chip 
cycles counters to identify the code paths traversed and measure 
the work required for each type of primitive file system operation 
I o minimize measurement overhead and improve accuracy 
cost measurements were taken in two steps. First, we used ATOM 
o annotate the entry and exit points of each procedure and issued 
specific requests, producing a dynamic call graph for each primi- 
tive operauon. Then we re-annotated the server routines, stanino at 
packet amval and ending at response packet dispatch, limiting 
annotat.on to the critical components of each operation's code 
path. For each operation, file system requests were repeatedly 



22 rfJ Se, t CUVe, y a™ 0 ^ server, generating traces that 
recorded code-path execution times. Measurements were repeated 
for a range of request sizes where appropriate and all the measure- 
menu are summarized in Table 3(b.c,d) and Table 4(b.c,d). 

4.3 Relative Importance of NFS and AFS Operations 

. Table 3(a) and Table 4(a) report the frequency distribution of 
various server operations for the traces. Each table describes the 
types of primitive operations and reports their frequencies and the 
toul number of occurrences in the trace. This data shows that 
attribute read requests (AttrRead, FetchStatus. BulkStatus) are the 
most frequently executed operations. While frequency statistics 
emphasize attribute operations, the cycle count data indicate that 

on^s^m P,aC£ 3 SigniflCanUy Wger P " burd «> 

J 0 ,?"" 5 ? C re ' atiVe im P° nance of various primitive opera- 
tions ,n the total workload applied to a file server, we estimatVthe 
otal amount of work performed by a server per request type dur- 

erver e : 0 rkr°H n °' ^ ~ S ^ ir '^ ™ ""-ate the total 
cTn, Z ^ ° PCrat,0n ,yPC by ^P^nE the per-type 

count of occurrences by the measured averaee per-type cvcle 
counL Since the NFS trace groups certain operation types togeTel 

IZtTf I C °' Umn 2 ° f TaWe 3(3) WC U " a -P-en ative 
member of each group to perform our NFS calculations. The rela- 

rZ ' mP ° f T Ce ° f 0Perati0 " Can be deduced from the per- 

centage of the server load attributable to each type, as shown in the 

Z STo"' T3b,e "* TablE 4(a) ' ™« eiJlZ show 
that data- m ov,„ g operates contribute 51% of the NFS workload 

slnl™ \ S r rUOad ' B£CaUSe th " e ^e far 

short of 100%. the performance gained by directly moving data 

between Cents and disks may be limned fDrapea^] As Te nex 

subsection shows, this limits the benefit of NetSCSl for offloading 

file manager workload and motivates the design of a NASD drive 
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Table 4(a) - AFS Trace Operations 



AFS Operation ■ 


Description 


Percent 


Quantity 
(thousands) 


% of Cycles 


FetchStatus 


Get metadata information 


65.1 


1052.2 


39.3 


BulkStatus 


Perform a group of FetchStatus operations 


5.8 


93.4 


10.9 


StoreStatus 


Update metadata information 


2.5 


40.4 


2.2 


FetchData 


Get data from server 


13.9 


224.0 


27.9 


StoreData 


Send data to server 


3.8 


61.5 


8.5 


CreateRle 


Create a new file 


1.7 


27.0 


2.4 , 


Rename 


Rename a file/directory 


0.6 


10.4 


0.9 


RemoveRle 


Remove a file 


1.5 


25.0 


2.4 


Others 


ACL manipulation, symbolic links, directory 
creation/deletion, lock management etc. 


5.0 


81.6 


5.4 



Table 4(b) - AFS Cost Measurements 



Operation 


Cycles according to Size of Operation (thousands) 


0 


1 


512 


1K 


2K 


4K 


8K 


16K 


32K 


64K 


1M 


FetchData 




179 


192 


191 


204 


270 


330 


439 


788 


1,544 




StoreData 


259 




291 


303 


363 


371 


410 


578 


750 


1,242 


16,752 


RemoveFile 




331 


396 


396 


410 


411 


412 


414 


429 


452 


1,053 



Table 4(c) 



Table 4(d) 



BulkStatus Size 
(directory entries) ■ 


Cycles 
(thousands) 


1 


151 


3 


178 


10 


324 


20 


578 ! 


25 


662 



Operation 


Cycles 
(thousands) 


FetchStatus 


128 


StoreStatus 


189 


CreateFile 


307 


Rename 


285 


Others 


227 



Table 4: Distribution and average costs of AFS operations. Cycle counts were taken on a DEC 3000/500 (150MHz, 128 MB of memory, 
Digital UNIX 3.2c) running an ATOMized AFS version 3.4 server. ATOM tracing overhead was negligible compared to other system-level effects on the 
server. The server's caches were warmed and trials that produced misses in the local file system cache were discarded. The number of cycles for "Others" 
was estimated as the average of the four size-independent operations that were measured individually (FetchStatiy;, StoreStatus, CreateFile, Rename). 



4.4 Busy Client-Minutes 

A distributed file system scales if an increase in aggregate cli- 
ent demand, and the corresponding increase in storage capacity 
and bandwidth, does not result in a decrease in cbent-observed 
performance. In a previous study [Riedel96], we examined the cor- 
relation between hourly averages of client response times, network 
round-trip times and server load. Users may be satisfied with their 
response times when servers are idle, but experience periods of 
dramatically longer response times which correlate with periods of 
high server load. Since client dissatisfaction is strongly determined 
by prolonged periods of considerably higher than average response 
time, this study focuses on server performance during such bursts 
of high load. For such a burst to have client impact, it must persist 
for a sufficiently long time. In this paper we have chosen to exam- 
ine load during one minute intervals — long enough for interactive 
users to identify a slowdown, but not so long that poor perfor- 
mance during bursts is hidden by overall averages. Our previous 
study also observed that periods of high server load may exhibit a 
different distribution of request types — data movement is more 
prevalent. In order to capture the distribution of operations during 
these critical bursty periods, we restrict the rest of our analysis to 
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Figure 4: Cumulative distribution of estimated server work for NFS and 
AFS intervals. The graph shows thai 98% of NFS client-minutes and 95% 
of AFS client-minutes require less than 0.1 seconds of estimated server 
work. 
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• 


NFS 

9 


AFS 


Number 


% of total 


Number 


% of total 


Busy client-minutes 


4,636 


2 


2,809 


5' 


Client machines 


135 


58 


78. 


31 


Requests 


3,730,031 


56 


1,199,419 


74 


Read data (GB) 


4.8 


59 


2.8 


96 


Write data (GB) 


1.7 


84 


1.3 f 84 



Table 5: Statistics for the. top 2% of NFS client-minutes, and top 5% 
of APS client-minutes, as measured by estimated work. 



the busiest one-minute intervals as measured by the amount of 
work detailed in Section 4.3. 

Based on this metric and the data in Figure 4, we chose to 
restrict analysis to client-minutes (single minutes of a single cli- 
ent's activity) that consume more than 0.1 seconds of server CPU 
(top 2% of NFS and top 5% of AFS client-minutes). Table 5 sum- 
marizes these busy cUent-rainutes. 

5 Analytic Model 

When a distributed file system is ported to NetSCSI or NASD 
environments, the disposition of client requests is adjusted accord- 
ing to the goals described in Section 3. The principal benefit we 
expect for an existing file system such as NFS or AFS is a more 
cost-effective scaling of throughput by a reduction in the file man- 
ager load. In this section, we develop a simple estimate of this 
scaling. Following the work estimates of Section 4.3, where total 
file manager work is estimated as the sum of operation costs 
weighted by the frequency of each operation, we derive estimates 
of the NetSCSI and NASD file manager work done by NFS and 
AFS operations by approximating these costs with SAD operations 
which accomplish similar amounts of work, as reported in Table 6. 
These estimates are only coarse approximations, but provide a rea- 
sonable estimate of the potential benefit of network-attached stor- 
age over SAD in terms of file manager scaling. 

In the NetSCSI model, the only change from SAD is that the 
read/write datapath avoids the file manager. However, each read or 
write request must still be authorized and translated to NetSCSI 
block addresses. As we see in Table 7, this severely limits the scal- 
ability of NetSCSI even though we optimistically model the man- 
ager cost of read or write as a simple attribute read in SAD. 
Specifically, this model estimates file manager work with 



NetSCSI to be at least two-thirds and three-quarters as much as 
with SAD during busy NFS and AFS client-minutes. 

In our NASD model, all read operations, including attribute 
and directory reads, are sent directly to the NASD drive. We fur- 
ther assume that NFS clients in NASD systems replace directory 
lookup operations with NASD (directory) object reads and execute 
the lookup locally. The data in file writes are also sent directly to 
the NASD drive. However, in order to support AFS consistency' 
semantics, data writes generate an additional request to the file 
manager. This request, which would allow the AFS file manager to 
perform the appropriate consistency maintenance (e.g. breaking 
callbacks), is estimated to require the same work as an attribute 
read request. NFS, which has a weaker consistency model, does 
not require this additional request For attribute and directory 
writes, we assume that clients must send their requests to the file 
manager. To estimate the file manager's pre-authorization and 
capability setup work prior to any access, we introduced a NASD 
open request which we emulate with an attribute read operation. 
Since NASD capabilities are valid for a limited time (twenty-four 
hours in this model), unless revoked by a change in access rights 
(an operation that is extremely rare in our traces), transforming the 
traces in this way adds one additional operation when a file is first 
referenced on a given day. Finally, remove operations, whose 
deallocation work is done by the NASD drive, require file man- 
ager work comparable to the removal of an empty file. 

For AFS, Table 7 shows that NASD systems may reduce file 
manager workload during busy client-minutes by a factor of two 
over NetSCSI systems and a factor of three over SAD systems. For 
NFS, where directory and attribute reads dominate the workload, 
file managers using NASD drives may benefit from a factor of' 
fourteen decrease in file manager load over SAD systems. 

6 Replay Experiment 

The analytic model neglects several factors. Particularly con- 
cerning is its inability to account for system-level activity (e.g. 
page faults, scheduler activity, thread overhead, queueing effects) 
that could significantly impact the behavior and performance of 
NetSCSI and NASD systems. Given our goal of justifying further 
implementation studies, we chose to explore system overheads and 
interactions by replaying the traces, modified according to Table 6 
to coarsely model the work of a NetSCSI or NASD server, against 
existing SAD implementations. This experiment allows us to mea- 
sure expected file manager load under SAD, NetSCSI and NASD, 



NFS trace 


SAD 


NetSCSI 


NASD 


AttrRead 


getattr 


getattr 




AttrWrite 


setattr 


setattr 


setattr 


BlockRead 


read(size) 


getattr 




BlockWrite 


write(size) 


getattr 




DirRead 


lookup 


lookup 




DirRW 


create 


create 


create 


DeleteWrite 


remove(size) 


re move (size) 


remove (0 byte) 


NasdOpen 






getattr 



AFS trace 


SAD 


NetSCSI 


NASD 


FetchStatus 


FetchStatus 


FetchStatus 




StoreStatus 


StoreStatus 


StoreStatus 


StoreStatus 


FetchData 


FetchData(size) 


FetchStatus 




StoreData 


StoreData(size) 


FetchStatus 


FetchStatus 


Remove File 


RemoveFile{size) 


RemoveFile(size) 


RemoveFile(0 byte) 


BulkStatus 


BulkStatus 


BulkStatus 




NasdOpen 






FetchStatus 



Table 6 Description of what the operations in the filesystem traces translate to in the SAD. NetSCSI and NASD models The tables list the ooerations as 
recorded ,n the trace and the corresponding RPC request issued by a client during replay for each of SAD. NetSCSI and NASD Z sed in The aXk 
calculous to esttmate server load in each model. The operations no. listed for AFS are the same across SAD, NetSCSI and NASD The last rowTn ea h 
tabic corresponds to the NASD open operation, which we added to the traces. 
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Table 7: W^^^^SKS i»A&SS££^ St 

&£SS£ ^tSS^S^^^ or NASD operation cycle count and the SAD tota, cyc.e count. 



capturing system-level activities not accounted for in the analytic 
model and more accurately estimating the increase in scalability 
possible with NetSCSI- and NASD-based systems. 

6.1 Experiment 

The replay environment, as illustrated in Figure 5, is com- 
posed of a single file manager (i.e. an NFS or AFS server) and sev- 
eral host workstations. We refer to these host workstations, used to 
replay (modified) trace requests to the file manager, as replay 
hosts. Each replay host merges several client-minute traces which 
it then replays using an open-loop request-issue model where mul- 
tiple threads replay each of the requests according to the issue 
timestamps. When the total load applied is well under the server's 
capability, the timing of operations approximates the original 
traces and the replay completes in about one minute. However, as 
the number of client-minutes grows, the work required of the 
server exceeds its capability and responses may be delayed so long 
that all client threads are blocked when a umestamp requires a 
request to be replayed. When such deadlines are missed often, the 
system degrades to a closed-loop experiment and the runtime can 
be significantly longer than one minute. During replay, file man- 
ager CPU load is measured by recording time spent in the kernel 
idle loop, and subtracting this from the total duration of the replay. 

To measure file server load as a function of increasing client 
demand, we varied the number of client-minutes replayed simulta- 
neously. To replay m client-minutes, we randomly select m client- 



minutes from the pool of busy client-minutes described in Section 
4.4. As indicated by the long tails of the distributions in Figure 4, 
different randomly selected sets of m client-minutes may have 



Replay process 



single client-minute 
(trace segment) 



Replay hosts 




Network 




merged 
trace 

threads 



issued 'at the recorded 

timestamp for the 
# requests in the traces 



CPU idle time 



File manager 



Figure 5: Setup of the trace-driven replay experiments. Multi-threaded 
processes on each replay host submit requests from a set of client-minutes 
lo the file server emulating the expected traffic in the case of SAD, 
NetSCSI, and NASD. 
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Figure 7. Mean and 90% confidence intervals of measured !oad for NFS and AFS rep.ay experiments for five samp.es of m clien.-minu.es 
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widely varying load. Therefore, we construct p (=5) samples for 
each set of m client- minutes, and report the mean and 90% confi- 
dence intervals. For comparison. Figure 6 reports the mean and 
90% conndence intervals in estimated file manager work accord- 
ing to the analytic model of Section 5 applied to the client-minutes 
selected for replay. 

6.2 Results 

Comparing Figure 6 and Figure 7, we see that replay experi- 
ences significantly more CPU work — work that was overlooked 
by the analytic model. At 90 client-minutes, the analytic 
NFS/NASD model predicts less than 30 CPU seconds while the 
replay model consumes over 60 CPU seconds. In spite of these dif- 
ferences, the replay results and the analytic data display a strong 
correlation in the relative performance of SAD, NetSCSI and 
NASD. For example, at 90 NFS client-minutes, both show a 40% 
difference in ]oad between SAD and NetSCSI, and a 90% differ- 



ence between SAD and NASD, with similar correspondence in the 
AFS case. The similarities between the results of Section 5 and 6 
suggest that, provided implementations on NetSCSI and NASD 
have operation costs similar to those in Table 6, NetSCSI provides 
limited benefit to existing distributed file systems (a factor of 
about 1.5 improvement). In contrast, NASD promises substantially 
lower file manager costs per client (factors of up to ten for NFS 
and up to five for AFS). 

From Figure 7, it appears that each data point required less 
than 60 seconds of file manager CPU, the amount available in 
these one minute replay experiments. However, when CPU satura- 
tion of the manager slows the generation of trace events, it causes 
the replay to run longer than 60 seconds. For NFS, replay overrun 
occurs with 80 or more client-minutes in SAD and NetSCSI and 
does not occur in NASD. For AFS, this overrun occurs with 25 or 
more client-minutes in SAD, 50 or more client-minutes in 
NetSCSI and 120 or more client-minutes in NASD. AFS suffers 
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more from this effect because its file manager is user-level with 
only one kernel thread — the entire file manager blocks on every 
disk access (NFS is in-kemel and the file manager has sixteen 
threads at its disposal). 

63 Cache Effects 

A limitation of the results in Figure 7 relates to our handling 
: : of file manager cache state. Hie use of samples constructed from 
P : iandom # busy one-minute intervals makes it difficult to determine 
t: what constitutes a realistic initial state for the data and metadata 
^ caches. Because a cache miss induces more file manager work 
' than a cache hit, biases which increase misses also increase work. 
> Further, because the file manager work in SAD, NetSCSI and 
: NASD differs, with far fewer cache accesses done by NetSCSI and 
NASD, it is reasonable to expect that SAD file manager work is 
over-estimated more by excess misses than the other cases. 

For this reason we should have run all workloads with warm 
caches, biasing in favor of SAD file managers. Unfortunately, our 
ability to control cache contents carefully was best when using 
cold caches. 4 Therefore, to bound the bias against SAD, we ran a 
simple experiment For NFS' s 1 0 and 20 client-minute workloads, 
\ the entire set of data accessed during those client-minutes fits in 
i ;he file manager's buffer cache. This allowed us to perform the 
NFS replay with an initially cold data cache, then repeat the same 
• replay without flushing the contents of the cache (thus starting 
from an optimally-warmed cache, containing all the data which 
will be accessed in the second run). In this case, the CPU load on 
the file manager in SAD decreased by 10-18%, which we take as 
the upper bound on the effect of warm versus cold data caches on 
the CPU load of a SAD file manager. To restate, Figure 7 may 
falsely penalize SAD performance by up to 18% because of cold 
data caches during replay. 

7 Conclusion and Future Directions 

Network-attached storage, by enabling direct transfers 
between client and storage, can substantially increase distributed 
file system scalability while simultaneously enabling striped stor- 
age to satisfy the bursty, high-bandwidth demands of the increas- 
ingly high-performance clients populating local area networks. 
This promises benefits in a wide enough range of storage markets 
and makes commodity network-attached storage feasible. 

In this paper we have presented a simple classification of 
storage architectures for distributed file systems with four models. 
The traditional, server-attached disk (SAD) model is our base case. 
Server-integrated (SID) disk systems, including specialized NFS 
server products, are architecturally identical, but have hardware 
and software designed specifically for file service. We do not 
emphasize this model because it binds storage products to a partic- 
ular choice of distributed file system. 

The remaining two storage models exploit the potential of 
network-attached storage. Network SCSI (NetSCSI) drives are 
very similar to current SCSI disks in that all file requests go 
through the file manager, but the resulting data transfers go 
directly between client and drive. This may reduce file manager 
workload during busy periods by about 30%. Different security 
models can be provided using NetSCSI depending on the crypto- 
graphic support provided in the drive. 



4 Even here our control was incomplete; NFS used a cold data cache, but a 
warm metadata cache. AFS uses both cold data and metadata caches. 



Network-attached secure disks (NASD) support storage 
semantics at a level between that of block-level protocols like 
SCSI and distributed file systems like NFS and AFS. The parti- 
tioning of file system functionality between NASD drive and file 
manager is optimized to reduce file manager load while maintain- 
ing system flexibility. To operate securely in the face of this parti- 
tion, NASD drives rely on cryptographic support for security and 
authorization. Our studies show that, by offloading data read and 
write and attribute and directory read operations, distributed file 
system server load during busy periods may be reduced by a factor 
of fourteen (NFS) and three (AFS) in the analytic model, and up to 
ten (NFS) and five (AFS) in the replay experiments. 

Our analysis focuses on describing the distinct methods of 
organizing storage arcWtecture. and estimating the potential 
improvement each promises for existing distributed file systems. 
With the positive results given here, our future directions are clear. 
We plan to demonstrate that distributed file systems can be imple- 
mented around network-attached storage, preserving powerful 
security models and yielding considerable scalability and client 
performance advantages. Along this path, many open questions 
remain. Our NASD model, in particular, expects a disk drive to be 
capable of computation not normally associated with cost-sensi- 
tive commodity peripherals; drive micro-architectures and soft- 
ware structures must be developed and demonstrated. 

Further, NASD's out-of-datapath file manager does not natu- 
rally provide the server caching found in traditional systems which 
store-and-forward data through the server. We must evaluate the 
penalty of distributing the caches among storage. This penalty may 
be mitigated if storage objects are striped over drives because 
striping inherently eliminates hotspots [Livny87]. On the other 
hand, server caching is significantly less important to performance 
than client caching and becomes less important still with coopera- • 
live caching in idle clients [Dahlin94, Feeley95] and aggressive 
prefetching by clients [Patterson95, Cao95]. 

Finally, in the NASD models presented, we assume that cli- 
ents "open" files by contacting the distributed file system server 
one file at a time to set up the state needed for, direct transfers to 
and from storage and allow the file manager to handle consistency. 
A clear improvement, similar to the effect of client caching in 
AFS, might be provided by pre- authorization or group- authoriza- 
tion schemes. 
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